Scale command / Z Distribuition
The main purpose here is to show how the scale command works in a dataset. In statistics it is very important to transform the data in Z-scale before analyzing them, thus obtaining a normalization of the dataset. When you have data of different magnitudes, for instance kilometers, seconds, temperature etc. makes all sense converts them into a more homogenous distribution facilitating the analysis and the input of these data in some algorithms. Graphically I’m showing a dataset that was created in a random way that is gradually being scaled (steps of 10%). Initially we have for the X axis values between -5 and 5 and for Y values between -25 and 25. At the end of the procedure of scaling the dataset, we will have both X and Y values between -2 and 2, without losing the proportionality.
# Generating random dataset
x <- runif(1000, -5, 5)
y <- runif(1000, -25, 25)
z <- runif(1000, 1, 2)
# Transforming in a data frame
sim.data <- data.frame(cbind(x,y,z))
# Creating a complete data frame with all stage of scales
sim.data.total <- sim.data
sim.data.total$pscale <- 0
# Steps of 10%
ntry <- 1:10
sim.part <- NULL
for (i in ntry) {
# define the part (start and end) of the dataset to take
end.r <- 100 * i
ini.r <- end.r - 99
sim.part <- data.frame(rbind(sim.part, scale(sim.data[ini.r:end.r,])))
# Scale only one part of the dataset
if ((end.r)<nrow(sim.data)) {
sim.data.scl <- data.frame(rbind(sim.part, sim.data[(end.r+1):nrow(sim.data),]))
}
# if end.r < nrow(sim.data) scale full dataset
else {
sim.data.scl <- data.frame(scale(sim.data))
}
sim.data.total <- rbind(sim.data.total, data.frame(x=sim.data.scl$x, y=sim.data.scl$y, z=sim.data.scl$z, pscale=i*10))
}
# Store in p variable the ggplot object
p <- ggplot(sim.data.total, aes(x=x, y=y, color=z)) + geom_point(aes(frame = pscale)) + ggtitle("Scaling the Dataset") + theme(plot.title = element_text(hjust = -5, vjust=0))
## Warning: Ignoring unknown aesthetics: frame
# Animate the ggplot object with ggploty
ggplotly(p) %>% animation_opts(1000) %>% animation_slider(currentvalue = list(prefix = "Scale ", posfix = "%", font = list(size=12, color="red"))) %>% config(displayModeBar = F)
Session info For reproducibility purposes it is always a good idea to capture the state of the environment that was used to generate the results:
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
## [5] LC_TIME=Portuguese_Brazil.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.8.0 ggplot2_3.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 later_0.7.3 pillar_1.3.0
## [4] compiler_3.4.4 plyr_1.8.4 bindr_0.1.1
## [7] tools_3.4.4 digest_0.6.15 viridisLite_0.3.0
## [10] jsonlite_1.5 evaluate_0.11 tibble_1.4.2
## [13] gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1
## [16] shiny_1.1.0 crosstalk_1.0.0 yaml_2.1.19
## [19] bindrcpp_0.2.2 withr_2.1.2 dplyr_0.7.6
## [22] stringr_1.3.1 httr_1.3.1 knitr_1.20
## [25] htmlwidgets_1.2 rprojroot_1.3-2 grid_3.4.4
## [28] tidyselect_0.2.4 glue_1.2.0 data.table_1.11.4
## [31] R6_2.2.2 rmarkdown_1.10 tidyr_0.8.1
## [34] purrr_0.2.5 magrittr_1.5 promises_1.0.1
## [37] backports_1.1.2 scales_0.5.0 htmltools_0.3.6
## [40] assertthat_0.2.0 xtable_1.8-2 mime_0.5
## [43] colorspace_1.3-2 httpuv_1.4.4.2 labeling_0.3
## [46] stringi_1.1.7 lazyeval_0.2.1 munsell_0.5.0
## [49] crayon_1.3.4